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Abstract- 


Tliis  paper  presents  two  algorithms  for  computing  the  minimum  spanning 
forest  of  an  input  graph  on  a  fat-tree  architecture.  One  algorithm  is  deter¬ 
ministic,  and  the  other  probabilistic.  The  deterministic  algorithm  generates 
0(log3 \V\)  messages  sets,  each  of  which  can  be  delivered  in  0(/?(C))  deliv¬ 
ery  cycles.  The  probabilistjc  iUg^riUim  generates  0(Iog2  [KJ)  message  sets, 
each  of  which  can  be  delivcrcjdji  0{(3{G))  delivery  cycles. 
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1  Introduction 


In  this  paper  we  present  a  parallel  algorithm  for  computing  the  minimum 
spanning  forest  of  a  graph  on  a  fat-tree  architecture.  That  is,  given  graph 
G  =  (V,  E)  where  the  edges  in  E  are  weighted,  we  want  to  find  a  set  of 
edges  forming  a  minimum  spanning  tree  for  each  connected  component.  We 
will  .analyze  the  running  time  not  only  in  terms  of  |V|  and  {£’(,  but  .also  in 
terms  of  how  efficiently  the  graph  has  been  embedded  in  the  fat-tree. 

Parallel  algorithms  for  computing  the  connected  components  or  the  min¬ 
imum  spanning  forest  of  an  input  graph  have  been  presented  for  numerous 
parallel  architectures  [AS,  13,  II,  I1CS,  KR,  SVj.  Awcrbuch  .and  Shiloach, 
for  example,  have  presented  a  minimum  spanning  forest  algorithm  with 
running  time  0(1  og\V{)  [ASj.  Their  algorithm  is  intended  for  a  PRAM 
(Parallel  Random  Access  Memory)  model  with  CRCW  (Concurrent  Read 
and  Concurrent  Write)  capabilities.  Each  of  the  (El  +  lPj  processors  in  this 
model  has  access  to  every  word  of  a  shared  memory.  While  this  model  is 
very  powerful,  the  connectivity  required  to  build  such  a  shared  memory  is  so 
high  tiiat  it  may  be  impractical  except  on  a  small  scale.  Other  authors  have 
presented  minimum  spanning  tree  algorithms  for  less  highly  connected  but 
also  less  general  architectures.  In  particular,  Bentley  has  presented  a  mini¬ 
mum  spanning  tree  algorithm  for  a  specialized  tree  architecture  containing 
\V\  processors  [Bj.  Bentley’s  algorithm  has  running  time  0{,V\ log \V\).  In 
this  paper  we  present  a  minimum  spanning  forest  algorithm  for  a  new  class 
of  universal  routing  networks  introduced  in  [L)  called  fat-trees. 

Leiserson  has  shown  that  under  the  assumption  that  only  0[A)  bits  may 
enter  or  leave  a  region  R  with  surface  area  A  in  unit  time,  fat-trees  have 
the  following  universality  property:  given  any  routing  network  R  consisting 
of  some  fixed  amount  of  hardware  (a  set  P  of  processing  elements  wired 
together  in  volume  V),  there  exists  a  fat-tree  built  with  the  same  amount 
of  hardware  that  can  simulate  the  original  network  at  a  cost  of  a  factor  of 
0(log3  jPJ)  in  time.  Thus  for  a  given  amount  of  hardware,  a  fat-tree  can  in 
theory  be  used  solve  a  problem,  such  as  computing  the  minimum  spanning 
forest  of  a  graph,  in  almost  optimal  time.  Leiserson's  theorem  indicates 
that  fat-trees  are  a  powerful  class  of  routing  networks.  His  paper,  however, 
explains  only  how  to  simulate  other  routing  networks  and  says  nothing 
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about  how  to  design  efTkient  algorithms  specifically  for  the  fat-tree.  In  this 
paper  we  describe  new  data  structures  and  techniques  that  may  be  useful 
in  future  fat-tree  algorithms. 

We  also  introduce  a  new  parameter  to  the  running  times  of  parallel 
algorithms.  The  running  times  of  sequential  and  parallel  algorithms  are 
typically  parameterized  by  the  size  of  the  input.  For  example,  the  two 
parallel  algorithms  mentioned  above  have  running  times  parameterized  by 
the  number  of  vertices  and  edges  in  the  input  graph  G.  The  new  parameter, 
which  we  will  call  the  base  load  factor  of  G ,  0(G),  is  a  measure  of  the 
communication  congestion  that  occurs  when  some  primitive  operation  is 
performed  in  parallel  on  the  input  data.  We  will  embed  each  vertex  v  €E 
G  in  a  different  processor  and  our  primitive  operation  will  be  for  each 
vertex  to  simultaneously  pass  a  message  to  each  neighboring  vertex.  In 
this  algorithm,  the  communication  congestion  of  every  message  set,  and 
consequently  of  the  entire  algorithm  can  be  expressed  in  terms  of  0(G). 

The  remainder  of  this  paper  is  organized  in  the  following  manner.  In 
section  2  we  define  a  fat-tree  architecture  and  the  concepts  of  message  sets 
and  their  load  factors.  In  the  sections  3  and  i  we  describe  the  message  set 
routing  results  of  Leiserson  and  Greenberg  [LGJ  and  prove  a  short  lemma 
extending  these  results.  In  section  5  we  describe  the  parallel  minimum 
spanning  forest  algorithm  that  we  arc  going  to  implement.  Our  imple¬ 
mentation  requires  the  auxilliary  data  structures  and  subalgorithms  that 
are  described  in  section  6.  Following  these  descriptions,  we  present  our 
minimum  spanning  forest  algorithm  in  section  7.  In  sections  8  and  9 
we  present  three  more  subalgorithms  of  the  minimum  spanning  forest  al¬ 
gorithm.  Section  10  is  an  analysis  of  the  running  time  of  the  algorithm. 
A  message  set  synchronization  scheme  using  the  ideas  of  this  paper  is  de¬ 
scribed  in  section  11.  We  conclude  with  a  few  comments  on  future  fat-tree 
research. 


2  Fat-Trees 

A  fat-tree  is  depicted  in  Figure  1.  The  underlying  structure  of  a  fat- 
tree  is  a  complete  binary  tree.  The  leaves  of  the  binary  Uee  arc  processor 
elements,  the  internal  nodes  are  switches,  and  the  edges  arc  communication 


2 


channels.  In  general,  the  capacities  of  the  communication  channels  increase 
as  the  tree  is  traversed  from  the  leaves  to  the  root.  More  formally,  a  fat- 
tree  is  an  ordered  triple  FT  =  (P,  N,  C)  where  P  is  the  set  of  processors 
found  at  the  leaves,  N  is  the  set  of  switches  found  at  the  internal  nodes, 
and  C  is  the  set  of  channels  found  at  the  edges.  We  let  cap(c)  denote  the 
capacity  of  a  channel  c  6  C,  that  is,  the  number  of  messages  that  may  be 
simultaneously  sent  through  c.  In  the  fat-trees  that  we  will  consider,  the 
channels  arc  unidirectional  and  paired.  That  is,  for  each  channel  going  up 
the  fat-tree  there  is  a  corresponding  channel  with  the  same  capacity  going 
down  the  fat-tree.  Each  processor  p  has  a  unique  address  in  the  fat-tree, 
l(p).  In  Figure  1,  for  example,  l(pi)  is  010.  We  assume  that  each  processor 
has  a  copy  of  its  own  address. 

Definition  1  A  message  set  M  C  P  x  P  is  a  set  of  messages  where 
(PiiPs)  €  M  is  a  message  from  processor  pi  to  processor  p2. 

Because  the  underlying  structure  of  a  fat-tree  is  a  tree,  message  (pi,p2) 
must  traverse  the  unique  path  from  pj  to  p2  in  FT. 

Definition  2  Let  load(A/,c)  be  the  number  of  messages  in  message  set  JVf 
that  must  traverse  channel  c  €  C. 

Definition  3  M  is  called  a  one-cycle  message  set  if  for  all  c  €  C, 

load(Af,  c)  <  cap(c). 

Because  none  of  the  channel  capacities  are  exceeded,  all  of  the  messages 
in  a  one-cycle  message  set  can  be  delivered  in  one  message  delivery  cycle. 

Definition  4  The  load  factor,  X[M,  c),  of  a  channel  c  6  C  due  to  a  mes¬ 
sage  set  M  is 

lo.nUAf.r) 
tap(c) 

Definition  5  The  load  factor  of  FT  due  to  M,  X (M),  is 

X (M)  —  maxf.  c  X(M,c). 
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3  Routing  Message  Sets 

Figure  2  shows  an  input  gr;iph  G  embedded  in  a  fat-tree  FT.  In  this  figure, 
the  channel  capacities  of  FT  and  the  edge  weights  of  G  have  been  omitted 
for  clarity.  Each  vertex  v  6  G  is  assigned  to  its  own  processor,  4>(v).  Where 
the  context  removes  any  ambiguity  we  will,  for  simplicity,  let  v  denote  <j>(v ) 
and  the  address  of  (f>(v ),  l[4>[v).  Let  v  have  neighbors  Ui,t>2»—  in  G.  In 
processor  <p(v )  we  store  the  adjacency  list  of  u,  (vl,Wi),(v2,W2)f...,(vk,'Wic), 
where  wt  denotes  the  weight  of  the  edge  connecting  v  and  vt.  In  this  ex¬ 
ample,  t>i  has  been  embedded  in  processor  pj.  In  p j  we  store  the  adjacency 
list  (010,1), (100, 2),  (110,2)  (111,2). 

Let  G1  be  the  graph  that  results  when  each  edge  in  G  is  replaced  by 
two  oppositely  directed  edges.  In  general  we  will  use  the  symbol  '  to  de¬ 
note  the  operation  of  replacing  each  edge  of  an  undirected  graph  with  two 
oppositely  directed  edges.  Let  Me  be  the  message  set  that  arises  when 
each  vertex  in  G1  sends  a  message  to  each  of  its  neighbors.  We  will  use  this 
primitive  operation  to  determine  the  base  load  factor  of  the  input  graph  G. 
Leiserson  and  Greenberg  have  shown  ILGi  that  an  arbitrary  message  set 
M  can  be  broken  up  into  one-cycle  message  sets  on-line  and  can,  with  high 
probability,  be  routed  in  0{\(M)  *  log \P\  log  log  \P\)  delivery  cycles.  This 
on-line  routine  algorithm  assumes  the  existence  of  a  hardware  mechanism 
to  synchronize  the  sending  of  messages  by  the  processors  in  P.  In  a  later 
section  we  will  show  how  message  set  synchronization  can  be  accomplished 
with  no  dedicated  hardware  other  than  increased  channel  capacities.  Us¬ 
ing  either  synchronization  scheme,  a  message  set  M  will  still,  with  high 
probability,  be  delivered  in  0(A(M)  -r  log  \P\  log  log  |P|)  delivery  cycles. 

The  choice  of  a  particular  on-line  message  set  routing  algorithm  is  not 
important  to  the  understanding  of  the  minimum  spanning  tree  algorithm. 
Throughout  the  paper  we  will  assume  that  we  have  some  mechanism  for 
synchronizing  message  sets  and  delivery  cycles  within  those  message  sets, 
and  for  deciding  which  messages  belong  in  which  delivery  cycles.  We  assume 
that  each  processor  knows  when  the  routing  of  a  message  set  M  begins  and 
ends,  and  when  each  delivery  cycle  used  to  route  M  has  begins  and  ends. 
If  we  use  the  algorithm  of  Leiserson,  and  Greenberg,  we  can  send  Me  in 
0(X(Me)  +  log  \P\  log  log  |Pj)  delivery  cycles.  We  define  /3(G),  he  base 
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load  factor  of  G,  to  be  A(.Vf(l-)  ~  log  P  log  log  P  .  We  will  analyze  the 
number  of  delivery  cycles  used  by  the  algorithm  in  terms  of  f3{G)  and  |V|. 

In  the  literature,  graph  algorithms  in  which  vertices  are  allowed  to  com¬ 
municate  only  with  their  neighbors  are  called  “distributed”.  In  such  algo¬ 
rithms  a  message  from  one  vertex  to  another  may  have  to  pass  through 
0{  V  )  intermediate  vertices.  Although  we  embed  each  vertex  in  its  own 
processor,  our  MSF  algorithm  is  not  in  this  sense  distributed.  We  may  pass 
a  message  from  v\  to  u2  even  when  they  are  not  neighbors  in  G. 


4  The  Shortcut  Lemma 

Figure  3  shows  a  message  set  Mo  in  which  processor  p\  sends  a  message 
to  p3  and  p3  sends  a  message  to  p3.  The  following  lemma  shows  that  we 
may  replace  these  two  messages  in  A/()  with  a  message  directly  from  pj  to 
p3  without  increasing  the  load  factor  of  Mo- 

Lemma  1  The  Shortcut  Lemma 

Let  pj,  p2,  and  p3,  be  leaves  of  a  fat  tree.  Suppose  pi  is  sending  a 

message  to  p2  and  p3  is  sending  a  message  to  p3  in  message  set  M0.  Then 

the  load  factor  of  the  message  set  that  results  when  these  two  messages  are 
replaced  by  a  message  directly  from  pi  to  p3, 

M  =  (M0u{(pi,p3)})  -  {(pi,p2),(p2,P3)}, 
is  not  greate’-  than  the  load  factor  of  the  original  message  set  Af0.  That  is, 

A(M)  <  A  (Mo). 

Proof:  It  will  suffice  to  show  that  load (M,c)  <  load (M0,c)  for  each  c  €  C, 
since  by  definition  A(A/0,c)  =  •  Since  the  underlying  structure 

of  FT  is  a  tree,  and  a  tree  cannot  contain  any  simple  cycles,  the  paths 

Pi  ~ *  P2  and  p2  — 1 *  p3  must  contain  the  unique  simple  path  pi  — »  p3. 

Therefore  message  (pi,p3)  passes  through  a  channel  c  only  if  either  (pi,p2) 
or  (p3)  P3)  does  also.  □ 

The  Shortcut  Lemma  can  be  extended  to  show  that  we  can  replace  any 
subset  of  Mo  that  forms  a  path  of  messages  from  pi  to  Pn  with  a  single 
message  directly  from  pi  to  p#. 
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Corollary  1  Extended.  Shortcut  Lemma 

Let  (pj.Pi),  (p2iP3)i  •••>  (pn  i, Pw)  €  M(|.  Suppose  we  replace  messages 
(pi,p2)  through  (pN  i,Pn)  with  a  single  message  (pi,Pn)  in  message  set 
M.  That  is,  let  M  =  (A/„  U  {(pi,pjv)})  -  {(Pi,P2)>  (ps.Ps), (Pw  i,Pn)}- 
Then  the  load  factor  of  M  will  not  be  greater  than  the  load  factor  for  Mo. 

A (M)  <  A(M„) 

Proof:  The  proof  is  by  induction  on  N.  □ 

In  general,  we  will  pass  a  message  from  vj  to  v 2  in  the  minimum  spanning 
forest  algorithm  only  if  there  is  a  path  in  G '  from  to  t>2.  Furthermore, 
in  any  set  of  messages  (v,,Vj)  that  we  send,  there  is  some  set  of  paths 
from  the  v,  to  the  v3  such  that  no  edge  in  E'  is  traversed  more  than  once. 
In  Figure  4  this  paradigm  has  been  violated.  We  can  remove  mess.age 
(pi i P2)  to  shortcut  (pnPi),  (P2,P3),  but  cannot  remove  it  again  to  shortcut 
(P11P2),  (P2,P-0-  However,  if  we  follow  this  paradigm  then  by  the  Extended 
Shortcut  Lemma  the  load  factor  of  every  message  set  will  be  less  than  or 
equal  to  A(M«<)  and  therefore  will,  with  high  probability  be  delivered  in 
/3(G)  or  fewer  delivery  cycles. 


5  Sollin’s  Algorithm 

Our  minimum  spanning  forest  algorithm  is  an  implementation  of  the  fol¬ 
lowing  parallel  algorithm  attributed  to  Sollin  in  [Gllj.  We  want  to  find  a 
set  of  edges,  Et,  that  forms  a  minimum  spanning  tree  for  each  connected 
component  of  C.  At  each  stage  of  the  algorithm,  let  Ti.Tj,  ...,Tj  denote  the 
subtrees  formed  by  the  edges  in  Et- 

Algorithm  1  Sollin’s  Algorithm 

Each  vertex  v,  is  an  isolated  subtree  T,  of  G. 

WHILE  there  are  edges  not  in  ET  connecting  the  T,  DO 
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Simultaneously  select,  for  each  subtree  Tu  the  edge  of  smallest 
weight  connecting  a  vertex  u  £  7*  with  a  vertex  v  £  TJti  ^  j.  If 
there  is  more  than  one  edge  with  the  same  smallest  weight,  break 
ties  by  giving  each  edge  u — v  a  label  k(e)  =  (max(u,  v),  min(u,  v)) 
and  choosing  the  edge  with  the  smallest  Label.  Add  the  selected 
edge  to  Er¬ 
in  each  iteration  of  Sollin’s  algorithm  we  want  to  quickly  gather  in¬ 
formation  about  all  of  the  edges  adjacent  to  a  subtree.  In  the  following 
section  we  explain  how  to  build  a  “communication  tree”  for  each  subtree  T, 
through  which  we  can  quickly  gather  this  information.  The  technique  we 
use  is  reminiscent  of  the  Euler  Tour  Technique  introduced  by  Tarjan  and 
Vishkin  in  [TVJ.  We  will  compute  the  Euler  tour  of  each  subtree  and  from 
this  tour  build  a  communication  efficient  communication  tree. 

6  Communication  Trees 

6.1  Euler  Cycles  of  Trees 

Let  7^  =  (Vt,E,)  be  a  tree  where  K,  C  V  and  E,  C  E.  Let  T[  =  (V,,E') 
be  the  directed  graph  that  results  when  each  edge  in  E ,  is  replaced  with  2 
oppositely  directed  edges  in  E[.  Clearly  7’/  is  connected  and  the  in-degree 
and  out-degree  of  each  vertex  in  7”  are  equal.  An  elementary  result  of 
graph  theory  is  that  in  any  directed  graph  where  for  each  vertex  the  in- 
degree  and  out-degree  are  equal  there  exists  a  directed  Euler  cycle  }E|.  Let 
C[  be  a  directed  Euler  cycle  of  7”. 

A  typical  step  in  our  minimum  spanning  forest  algorithm  is  to  merge 
the  directed  Euler  cycles  of  two  vertex  disjoint  trees  Tu  =  (VutEu)  and 
Tv  =  ( Vv,Ev )  connected  by  an  edge  e  between  vertices  tt,  £  Vu  and  v}-  £  Vv 
to  form  the  larger  directed  Euler  cycle  of  Tuv  =  (VuuVv,EuuEvU{e})  Let 
and  vy-^u,  be  two  oppositely  directed  edges  between  the  endpoints 
of  e  and  let  C'u  and  C'v  be  directed  Euler  cycles  of  and  T'v. 

—  uo  — »  Uj  — *  — *  ...  — ♦  tt,  i  — »  u,  —*  u,\,i  — »  ...  — ♦  uq 

=  v0  vl  — *  V2  ...  — *  Vj-i  — »  Vj  — *  Vy.i  — »  ...  —*  Vq 
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Figure  5  illustrates  the  process  of  merging  C and  C'  to  form  the  larger 
directed  Euler  cycle  of  the  tree  T'uv  =  (Vu  U  V,,,  E'u  U  E'v  U  {e,y,ey,})  Cycles 
C'u  and  C'  are  broken  apart  at  vertices  u,  and  Vy,  and  then  merged  by 
connecting  u,  and  uy  with  edges  e,y  and  ey^.  The  resulting  cycle, 

Cuv  =  Ufl  *  *  ui  1  *  *  VyCty  >  ;  1  > 

->  V()  ...  -+  V}  1  -»  Vy  ->  U,e,y  U() 

is  a  directed  Euler  cycle  of  Tuv. 

Throughout  the  execution  of  the  algorithm  we  will  maintain  a  set  of 
subtrees  of  G,  {Ti.To, ...,T;},  and  a  set  of  directed  Euler  cycles  of  those 
trees  { CJ ,  CA , . . - ,  C7/ } -  Each  vertex  v  £  V  will  belong  to  exactly  one  tree 
T,  and  to  the  Euler  cycle  C-  of  T-.  We  will  repeatedly  merge  the  7]  and 
their  directed  Euler  cycles,  the  C'  by  adding  connecting  edges.  When  the 
algorithm  terminates,  the  condition  u,v  €  V,  <=>  u,v  6  C[  <=>  u,v  in 
same  connected  component  of  G  will  be  satisfied. 


6.2  Building  Communication  Trees 

In  describing  the  following  communication  tree  construction  algorithm,  we 
will  consider  only  one  of  the  subtrees  of  G ,  T,.  The  algorithm,  however, 
runs  simultaneously  on  all  of  the  subtrees  formed  by  the  edges  in  Et  with 
no  communication  or  interference  between  them. 

The  communication  tree  construction  algorithm  requires  a  subroutine 
that  pairs  vertices  in  a  cycle.  In  a  later  section  we  will  describe  two  algo¬ 
rithms  that  perform  this  pairing,  one  deterministic,  .and  the  other  proba¬ 
bilistic. 

Wc  build  a  communication  tree  from  the  bottom  up  by  repeatedly  merg¬ 
ing  subtrees  of  the  communication  tree  to  form  larger  subtrees.  We  keep 
the  roots  of  the  subtrees  in  a  cycle  in  the  order  that  they  appear  in  the 
cycle  C'  and  pick  pairs  of  them  to  merge.  When  the  mergers  arc  complete 
wc  construct  a  new  cycle  of  subtree  roots  by  removing  the  vertices  that  are 
no  longer  roots. 

In  Figure  6  wc  show  the  how  a  pair  of  subtrees  arc  merged.  When  we 
merge  two  subtrees  we  want  the  inordcr  traversal  of  the  resulting  subtree  to 
be  the  same  as  the  concatenation  of  the  inordcr  traversals  of  the  merging 
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subtrees.  We  do  this  by  making  the  rightmost  leaf  of  the  left  subtree 
the  root  of  the  resulting  subtree.  The  left  and  right  roots  of  the  merging 
subtrees  then  become  the  left  and  right  children  of  the  new  root.  In  the 
inovder  traversal  of  the  final  communication  tree,  the  nodes  appear  in  the 
same  order  that  they  appear  in  C1. 

In  Figure  7  we  show  a  cycle  of  subtree  roots  before  the  mergers  h.nvc 
taken  place.  The  circled  pairs  of  vertices  arc  the  roots  of  the  subtrees 
that  will  merge.  In  Figure  8  we  show  the  cycle  of  subtree  roots  after  the 
mergers  have  occurred.  In  each  merge,  the  rightmost  leaf  of  the  left  subtree 
has  become  the  new  root. 

Algorithm  2  Communication  Tree  Construction  Algorithm 

Let  R'  denote  a  cycle  of  the  roots  of  the  subtrees  of  communication 
tree  CT  after  i  iterations  of  the  communication  tree  construction 
algorithm.  Let  denote  a  subtree  of  CT  with  root  r‘  and  rightmost 
leaf  after  i  iterations  of  the  algorithm. 

Initially,  i  =  0  and  Rt]  =  C1. 

WHILE  R'  contains  more  than  one  root  DO 

IF  R'  contains  only  two  subtree  roots,  rj  and  r'k 
THEN 

Let  the  vertex  with  the  smaller  label,  r*  or  rj  be  consid¬ 
ered  the  vertex  on  the  left. 

ELSE 

Use  cither  the  deterministic  or  the  probabilistic  pairing 
algorithm  to  pair  off  roots  in  R'. 

For  each  ordered  pair  of  roots  (r},r*) 

1.  Let  T']k  1  be  the  union  of  7^  and  7*. 

2.  Remove  from  Tjkl  the  edge  connecting  l'}  to  his  father. 

(a)  rj  sends  /’  a  message  indicating  that  l'}  is  to  become  a 
subtree  root. 
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V-V-  r. 


* .  ■ 


(b)  I*  sends  his  father  a  message  indicating  that  he  is  no 
longer  a  leaf 

3.  Make  r‘  the  left  child  of  l'y 

4.  Make  r\  the  right  child  of 

(a)  r‘  sends  r'fc  a  message  containing  the  identity  of  fj 

(b)  r\  sends  a  message  containing  the  identity  of  r[ 

5.  Let  1  =  /*, 

6.  Let  Zjj1  =  l[. 

(a)  r[  sends  f*  a  message  containing  the  identity  of  llk 

7.  Let  R'1  =  R'. 

8.  Replace  r ‘  and  r\  in  Rl ' 1  with  /* 

(a)  r\  sends  i’  a  message  containing  the  identity  of  rj^’s  right 
neighbor 

(b)  r‘  sends  his  left  neighbor  a  message  containing  the  in- 
dentity  of  1'3 

In  Figure  9  a  communication  tree  is  constructed  for  a  subtree  containing 
5  vertices.  In  the  first  iteration,  three  pairs,  (a,  6),  (c,d),  and  (e,  /)  are 
formed.  The  roots  on  the  left,  a,  c,  and  e  become  the  roots  of  the  subtrees 
after  the  mergers,  and  6,  d,  and  /  are  removed  from  the  cycle  of  subtree 
roots.  In  the  second  iteration,  c  and  e  pair.  Note  how  the  rightmost  leaf 
of  c’s  subtree,  d,  becomes  the  root  of  the  subtree  resulting  from  the  merge. 
Now  only  two  subtree  roots  are  left,  a  and  d.  Assuming  that  a  has  a  smaller 
label  than  d,  the  final  subtree,  communication  tree,  has  6  as  its  root.  A 
quick  look  at  the  communication  tree  reveals  that  in  its  inorder  traversal, 
the  vertices  are  visited  in  the  same  order  th.at  they  appear  in  C1. 

6.3  Communication  tree  Broadcasting  Algorithm 

Our  motivation  for  building  a  communication  tree  CT  is  to  provide  away  to 
quickly  gather  information  from  and  pass  information  to  all  of  the  vertices 
in  a  subtree.  We  can  use  the  following  algorithm  to  broadcast  a  message 
from  the  root  of  CT  to  all  of  the  vertices  in  CT.  The  same  algorithm  can 
be  run  in  reverse,  with  each  internal  node  forwarding  to  his  father  only  one 
of  the  messages  that  he  receives  from  his  children. 
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Algorithm  3  Communication  Tree  Broadcasting  Algorithm 


A  message  is  passed  down  from  the  root  to  all  of  the  nodes  in  com¬ 
munication  tree  CT  one  level  at  a  time.  The  message  is  first  simul¬ 
taneously  sent  from  the  root  of  CT  to  his  left  child,  and  then  to  his 
right  child.  When  all  of  his  children  have  received  the  message,  they 
forward  it  to  their  left  children,  and  then  to  their  right  children.  The 
process  is  repeated  until  the  leaves  of  CT  have  received  the  message. 


7  Minimum  Spanning  Forest  Algorithm 

We  can  efficiently  implement  Sollin’s  algorithm  by  using  a  communication 
tree  to  coordinate  communication  between  the  vertices  in  each  subtree  in 
Et .  Recall  that  the  principle  step  in  Sollin’s  algorithm  is  to  choose  for  each 
subtree  Tt,  the  edge  of  smallest  weight  connecting  a  vertex  u  G  T,  with  a 
vertex  v  G  T,,  i  ±  j.  In  our  implementation,  each  vertex  u  GT,  sends  his 
neighbors  in  G  the  label  of  the  root  of  CT,.  He  then  chooses  the  lightest  edge 
connecting  him  to  a  vertex  in  another  communication  tree  CT,,  breaking 
ties  lexicographically  as  described  before.  These  potential  merging  edges 
arc  then  passed  up  CT,,  which  is  used  as  a  decision  tree,  allowing  only  the 
lightest  of  the  edges  reaching  each  internal  node  to  progress  up  the  tree 
(again  we  may  have  to  break  lies  lexicographically).  The  unique  lightest 
edge  reaching  the  root  is  used  to  merge  the  two  trees  that  it  connects,  7* 
and  Ty  If  no  edge  reaches  the  root,  then  T,  is  a  minimum  spanning  tree 
for  a  connected  component  of  G,  and  the  vertices  in  that  component  may 
become  idle. 

After  each  merger  occurs,  a  new  communication  tree  CT  is  constructed 
from  the  combined  cycles  C'  and  Cj ,  C[},  and  the  root  of  CT  informs  the 
vertices  in  TtJ  of  his  identity. 

Note  that  several  trees  may  choose  to  merge  with  T„  with  possibly  more 
than  one  of  them  wishing  to  break  C'  at  a  single  vertex  u.  We  will  show 
how  to  implement  directed  Euler  cycles  so  that  an  arbitrary  number  of 
cycles  can  be  efficiently  merged  at  u. 


15 


Iv-V£ 


Figure  8:  Cycle  of  Subtree  Roots  After 


mergers 


Figure  9:  Building  a  Communication  Tree 


Algorithm  4  Minimum  Spanning  Forest 


Et  =  0 

Each  vertex  u,  is  active  and  is  an  isolated  subtree,  T*. 

WHILE  there  are  active  vertices  DO 

1.  Each  vertex  u  sends  the  label  of  the  root  of  his  communication 
tree  to  his  neighbors  in  G. 

2.  Each  vertex  u  determines  which  edge  adjacent  to  a  different  tree 
has  the  smallest  weight.  lii  the  case  of  a  tic,  each  edge  u— v 
is  given  a  label  k(c)  =  (niax(u,  v),rmn(u,v)}  and  the  edge  with 
the  smallest  label  is  chosen. 

3.  Each  vertex  u  at  the  leaf  of  a  communication  tree  passes  the  edge 
of  smallest  weight  that  reaches  him  to  his  father  in  the  commu¬ 
nication  tree.  The  communication  tree  is  used  as  a  decision  tree, 
allowing  only  the  smallest  edge  reaching  each  internal  vertex  to 
pass. 

4.  The  root  of  each  communication  tree  CT  broadcasts  the  merging 
edge  to  all  of  t  he  vertices  in  CT.  If  no  edge  has  reached  the  root, 
then  the  minimum  spanning  tree  of  this  connected  component 
has  been  found.  In  this  case,  the  root  of  CT  broadcasts  a  halt 
instruction. 

5.  For  each  merging  edge  u  —  v,  u  G  T,,  v  €  T}>  merge  C{  and  C'. 

6.  For  each  directed  cycle,  build  a  new  communication  tree, 

7.  The  new  root  of  each  communication  tree  broadcasts  his  label. 

In  Figures  10  through  13  we  show  a  sample  execution  of  the  MSF 
algorithm.  As  Figure  10  shows,  the  input  graph  G  contains  four  vertices, 
cacli  of  which  has  two  neighbors.  Initially  each  vertex  v,  is  an  isolated  tree 
Ti,  and  the  set  of  minimum  spanning  forest  edges,  JCt,  is  empty. 

In  Figure  11  v.c  show  the  choices  of  merging  edges  made  by  the  t\  in 
the  first  iteration.  Each  vertex  chooses  the  lightest  edge  connecting  him 
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to  a  vertex  in  a  different  subtree.  As  the  figure  shows,  v(i  chooses  e3,  t>i 
chooses  e,i,  v2  chooses  e3,  and  v3  chooses  ei.  Note  that  t>3  must  break  a  tie 
between  edges  ei  and  e2  lexicographically. 

At  this  point  each  vertex  v,  is  the  sole  vertex  of  a  tree  T,,  so  the  edge 
that  v,  picks  becomes  the  merging  edge  for  T,.  In  Figure  12  we  show  the 
cycle  C  constructed  from  edges  e(),  ej,  and  e3.  In  this  example  one  cycle 
contains  all  of  the  vertices  in  V .  For  clarity,  we  have  chosen  to  label  the 
.appearances  of  the  v,  in  C'  a,  b,  c,  d,  c,  and  f. 

In  figure  13  we  show  the  communication  tree  CT  constructed  from 
the  cycle  C.  This  tree  is  exactly  that  of  Figure  9.  At  the  erd  of  the  first 
iteration,  b  sends  all  of  the  vertices  in  CT  his  Label.  Since  all  of  the  vertices 
in  V  are  contained  in  CT,  in  the  second  iteration  no  vertex  finds  that  he 
is  adjacent  to  a  vertex  in  any  other  communication  tree  and  b  broadcasts 
a  halt  instruction. 


8  Implementation  of  Directed  Euler  Cycles 

Let  Tu  be  a  subtree  of  G  with  directed  Euler  cycle  C 'u.  We  want  to  main¬ 
tain  the  cycle  C'v  by  storing  information  at  the  vertices  that  appear  on  the 
C'u.  We  also  want  to  quickly  merge  two  cycles,  C'u  and  C'v  by  adding  two 
oppositely  direct  edges  between  a  vertex  u,  G  Vu  and  v:  G  Vv.  Thus  for 
each  edge  e  €  E’u  adjacent  to  u,,  we  store  outU|(e),  the  edge  to  traverse  out 
from  it,  after  e  is  traversed  in  to  u,.  In  order  to  merge  two  cycles  C'u  and 
C'  by  adding  edges  etJ  and  e}j  with  endpoints  u,  E  Vu  and  o,  G  Vv  we  pick 
an  edge  e'  E  Eu  adjacent  to  u,  and  an  edge  e"  G  Ev  .adjacent  to  v}  perform 
the  following  operations: 


0,1  to.  («>•) 
outu,  (e') 
outVj(et]) 
out„;(e") 


-  outUi(e') 
=  e,y 

=  outV;(e") 


=  e, 


These  operations  are  illustrated  in  Figure  14. 
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V. 


->  v  _^>  v,  — >  v 


'3 

c 


Figure  12:  Directed  Eider  Cycle 


Figure  Id:  Communication  Tree 
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9  Pairing  Algorithms 

Let  C[  be  the  directed  Euler  cycle  of  tree  Tt.  A  vertex  v  may  appear  more 
than  once  in  such  a  cycle.  Suppose  vertex  v,  appears  after  vertex  Vi~i  in 
C[.  We  can  give  v  the  label  k(v)  =  ( v ,  i,vt).  No  other  vertex  can  have  the 
same  label,  because  the  cycle  is  Eulcrian.  For  simplicity,  we  shall  speak 
of  each  of  these  uniquely  labelled  appearances  of  v  as  if  they  we  different 
vertices  in  Ct.  We  may  use  the  either  of  the  following  algorithms  to  pair 
the  vertices  in  C'. 

9.1  Deterministic  Pairing  Algorithm 

Algorithm  5  Deterministic  Pairing 

Let  k(vl)J  be  bit  j  of  the  binary  encoding  of  k(vt). 

For  each  active  vertex  t’,,  perform  the  following  operations  synchronously. 
Vertex  v,  becomes  active. 

FOR  1  :=  low  order  bit  position  of  k[v)  to  high  order  bit  position  of 
k[v)  DO 

IF  k[v,)i  =  1  and  k(v, ,  j)/  =  0 

THEN 

Let  it  =  v,+j. 

Send  a  message  to  u. 

IF  l(vk),  =  0  and  l(vk-i)i  is  1 
THEN 

Let  it  =  n,_ j. 

Send  a  message  to  u. 

IF  v,  received  a  message  from  a  vertex  u 
THEN 

v,  pairs  with  it  and  becomes  in  active 
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Active  vertices  repeat  the  process  above  for  0  — ►  1  transitions. 


In  this  algorithm  we  alternate  sending  messages  sets  consisting  of  mes¬ 
sages  only  to  right  neighbors  in  C-  and  only  to  left  neighbors.  At  no  point 
do  we  intersperse  messages  to  left  and  right  neighbors. 

In  Figures  15  and  16  wc  show  the  execution  of  the  deterministic  pairing 
algorithm  on  a  cycle  of  length  6.  Recall  that  the  same  vertex  may  appear 
more  than  once  on  a  cycle,  but  that  each  appearance  is  given  a  unique 
label.  In  Figure  15  the  binary  representation  of  the  unique  label  of  each 
vertex  appearance  is  listed  below  that  appearance. 

As  Figure  16  shows,  two  pairs  are  formed  when  1  — ►  0  transitions  are 
examined.  The  first  such  transition  occurs  in  the  low  order  bit  position 
from  vertex  vj  (appearance  (3,1))  to  vertex  Uj  (appearance  (1,0)).  The 
second  pair  is  found  when  the  transition  between  v()  (appearance  (0,2)) 
and  V2  (appearance  (2,0))  in  the  next  bit  position  is  examined.  A  third 
pair  is  found  by  examining  the  0  — ♦  1  transitions.  Note  that  several  1  — »  0 
and  0  — »  1  transitions  are  ignored  because  by  the  time  they  are  examined, 
one  or  both  of  the  vertices  involved  have  already  paired. 


9.2  Probabilistic  Pairing  Algorithm 

As  we  will  later  show,  the  deterministic  pairing  algorithm  has  the  disad¬ 
vantage  that  it  requires  0(Iog  | V|)  iterations  in  which  very  sparse  message 
sets  are  generated.  While  the  following  probabilistic  pairing  algorithm  is 
not  quarantced  to  pair  a  constant  fraction  of  the  vertices  in  cycle  C'  we 
will  show  that  using  the  probabilistic  pairing  algorithm  the  communication 
tree  construction  algorithm  will,  with  high  probability,  take  the  same  order 
of  iterations  as  when  the  deterministic  algorithm  is  used.  The  algorithm 
assumes  that  a  processor  has  the  capability  of  making  a  random  choice. 
Recall  that  this  capability  is  already  required  by  our  probabilistic  on-line 
routing  algorithm. 

Algorithm  6  Probabilistic  Pairing 

Let  V{  appear  on  cycle  C'  between  vertices  v,  _i  and  v,,  j. 
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For  each  vertex  u,  perform  the  following  operations  synchronously. 
Randomly  pick  i>,_i  or  vt  I  j  to  send  a  message  to.  Call  this  vertex  u. 
IF  u  =  vtil 
THEN 

Send  a  message  to  u. 

IF  u  — vi.  i 
THEN 

Send  a  message  to  u.  ' 

IF  y,  received  a  message  from  u 
THEN 

v,  pairs  with  u  and  becomes  inactive 

10  Analysis 

10.1  Deterministic  Pairing 

Lemma  2  The  deterministic  pairing  algorithms  pairs  at  least  |  of  the  ver¬ 
tices  in  the  cycle  C'. 

Proof:  The  label  of  every  vertex  k(vt)  must  differ  with  each  of  j) 
and  &(*>,  ,  i)  in  at  least  one  bit  position  l.  If  the  transition  from 
to  k(vt)i  is  1  — ►  0,  then  either  there  is  a  0  — ♦  1  transition  from  &(«,)/  to 
k{vii-i)i  or  =  k(v{, i)/.  Since  we  treat  1  — »  0  and  0  — ♦  1  transitions 
in  separate  iteration  loops,  every  bit  difference  between  tbe  labels  of  two 
neighbors  is  considered  individually. 

Assume  that  two  neighbors  arc  both  unpaired  when  the  algorithm  ter¬ 
minates.  These  two  neighbors  must  differ  in  at  least  one  bit  position.  Since 
this  difference  was  considered  in  some  iteration,  they  should  have  paired 
that  iteration.  Therefore,  no  two  neighbors  can  be  both  unpaired  when  the 
algorithm  terminates.  At  the  end  of  the  algorithm,  then,  a  vertex  can  be 
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unpaired  only  if  both  of  his  neighbors  paired  with  other  vertices.  Thus  at 
least  ^  of  the  vertices  in  C'  are  paired  when  the  algorithm  terminates.  □ 

Lemma  3  The  deterministic  pairing  algorithm  generates  0(log|V|)  mes¬ 
sage  sets,  each  of  which  can  be  delivered  in  0(/3(C))  delivery  cycles. 

Proof:  We  can  encode  k(v)  in  0(logjV,j)  bits.  The  algorithm,  therefore, 
may  perform  0(Iog|V|)  iterations. 

The  algorithm  operates  on  all  cycles  simultaneously.  In  each  iteration, 
two  separate  message  sets  are  generated,  one  of  messages  to  right  neighbors, 
Ain,  and  one  of  messages  to  left  neighbors,  Mi .  The  messages  in  one  of 
these  message  sets  travel  in  the  direction  of  the  edges  in  the  cycles  while 
the  messages  in  the  other  travel  in  the  opposite  direction. 

For  every  edge  e  €  Et  between  vertices  u,  and  Vj,  directed  edges  e^  €  E' 
and  eJt  €  E'  appear  once  in  the  cycles  of  the  minimum  spanning  forest 
subtrees.  Since  we  are  assuming  that  the  capacities  of  corresponding  up 
and  down  channels  in  our  fat-tree  arc  equal,  traversing  a  directed  edge  etJ 
in  reverse  is  equivalent  to  traversing  e}1,  and  vice  versa.  Thus  each  of  the 
message  sets  Mu  and  Mi  traverse  subsets  of  the  edges  in  E\  and  can  be 
delivered  in  0(/3(G))  delivery  cycles.  □ 

Lemma  4  Using  the  deterministic  pairing  algorithm,  the  communication 
tree  construction  algorithm  produces  trees  of  height  0(log  | V”  j) . 

Proof:  In  each  iteration,  ^  of  the  remaining  subtree  roots  will  be  paired  by 
the  pairing  algorithm.  In  each  merger  both  subtree  roots  become  internal 
nodes,  and  the  root  of  the  rightmost  leaf  of  the  left  subtree  becomes  a 
subtree  root.  Therefore,  the  number  of  subtree  roots  is  reduced  by  |  in 
each  iteration.  After  0{log\V\)  steps,  only  one  root  will  remain. 

As  the  height  of  a  subtree  grows  by  at  most  one  in  each  iteration,  each 
resulting  communication  tree  have  height  0(log\V\).  □ 

10.2  Probabilistic  Pairing 

We  would  like  to  show  that  with  high  probability,  the  communication  trees 
constructed  using  the  probabilistic  pairing  algorithm  will  have  the  same 
height  ns  those  constructed  using  the  deterministic  algorithm. 
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Whenever  two  subtrees  merge  into  one  subtree,  we  define  the  right  sub¬ 
tree  to  be  the  one  that  merges  and  the  left  subtree  to  be  the  one  that 
remains  in  the  cycle. 

Consider  subtree  roots  r,  €  P,  and  r,  (  i  €  i  below.  When  Ti  and  71+1 
merge,  T,  is  the  tree  that  remains  in  the  cycle. 

-»  n  -*  r,-(  i  -+ 

Lemma  5  In  any  iteration  of  the  CT  construction  algorithm,  a  subtree  has 
probability  j  of  remaining  in  the  cycle. 

Proof:  We  are  assuming  that  each  vertex  chooses  to  pair  with  cither  his 
left  or  right  neighbor  in  the  cycle  with  equ.nl  probability  Jind  independent 
of  the  choice  of  any  other  vertex.  A  subtree  merges  in  a  given  round  if 
the  root  of  that  subtree  chooses  his  right  neighbor,  and  his  right  neighbor 
chooses  him.  This  probability  is  computed  below. 

Pr(u,  chooses  v, ,  i  and  v,.  i  chooses  vt)  =  Pr(t>,  chooses  u,-.  i)  Pr(t», .  |  chooses  «,•) 

—  ii 
_  22 
_  1 

~  4 

A  subtree  remains  in  the  cycle  whenever  it  docs  not  merge.  This  prob¬ 
ability  is  computed  below. 

Pr(T, .  i  remains  a  subtree)  =  1  -  Pr(u,  chooses  v,  ri  and  v,  chooses  vt) 

_  3 

4 

□ 

Lemma  6  The  probability  that  a  subtree  Ti  will  remain  in  the  cycle  of  sub¬ 
tree  roots  after  m  rounds  of  the  communication  tree  construction  algorithm 


Proof:  We  are  assuming  that  the  choices  made  in  each  iteration  are  inde¬ 
pendent  of  the  choices  made  in  any  other  iteration.  Let  M-  be  the  event 
that  subtree  T \  merges  in  round  j.  The  probability  that  7^  will  merge  in 
round  m  is  simply  j  the  probability  that  T,  did  not  merge  in  any  of  the 
previous  m  -  1  rounds.  That  is, 
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om  1  i 

pwn  =  i  5 

Let  A/,  be  the  event  that  subtree  T,  merges  in  one  of  the  first  m  iterations. 
This  probability  is  the  sum  of  a  geometric  scries: 

pm  =  i(Vv) 

-  t  _ 

~  1  4 

A  subtree  remains  in  the  cycle  after  m  iterations  when  he  docs  not  merge 
in  any  of  the  first  m  iterations.  Let  R,  be  the  event  that  subtree  Tt  remains 
in  the  cycle  after  m  iterations.  The  probability  that  a  subtree  does  not 
merge  in  any  of  the  first  m  iterations  is  expressed  below. 

P(fl.)  =  l-P(AA) 

_  3m 


□ 

Lemma  7  The  probability  that  the  communication  tree  construction  algo¬ 
rithm  will  build  any  communication  tree  of  height  greater  than  greater  than 
k\og\V\  .5  0(^1). 

Proof:  As  in  the  previous  lemma,  let  /Z,  be  the  event  that  subtree  T,  remains 
in  the  cycle  of  subtree  roots  after  m  iterations.  An  elementary  theorem  of 
probability  is  that  the  probability  of  the  union  of  one  or  more  events  is  less 
than  or  equal  to  the  sum  of  the  individual  probabilities  of  those  events. 
Applying  this  relation  to  the  P,,  we  have 

Pr(/?j  U  /Z2  U  ...  U  /?  v  )  <  P(/?i)  +  P(/?2)  +  ...  +  P(fivi) 

<  iniir- 

For  m  =  k\ogn  iterations,  we  have 
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P(/?!  U  fl2  U  ...  U /L,)  < 

< 

< 

< 

< 

□ 

10.3  CT  Construction 

Lemma  8  Using  the  deterministic  pairing  algorithm,  the  communication 
tree  construction  algorithm  generates  0(log2  j^l)  message  sets,  each  of  which 
can  be  delivered  in  0{(3{G))  delivery  cycles. 

Proof:  By  lemma  4,  the  communication  tree  construction  .algorithm  per* 
forms  0(log|V|)  iterations  in  which  the  deterministic  pairing  algorithm 
generates  O(log)Vj)  messages  sets.  By  lemma  3,  each  of  these  message 
sets  can  be  delivered  in  O(0(G))  delivery  cycles. 

In  addition  to  the  message  sets  generated  by  the  pairing  algorithm,  the 
communication  tree  construction  algorithm  generates  a  constant  number  of 
messages  sets  in  replacing  the  left  and  right  subtree  roots  of  each  pair  with 
the  rightmost  leaf  of  the  left  subtree.  However,  by  an  argument  analogous 
to  that  of  lemma  3,  each  of  these  message  sets  can  be  delivered  in  0{(3{G)) 
delivery  cycles.  □ 

Lemma  9  Using  the  probabilistic  pairing  algorithm,  the  communication 
tree  construction  algorithm  generates  0(Iog  jV'l)  message  sets,  each  of  which 
can  be  delivered  in  0(/3(G ))  delivery  cycles. 

Proof:  The  proof  is  analogous  to  the  proof  of  lemma  8.  We  use  the  fact 
that  by  an  argument  similar  to  that  of  lemma  3,  the  probabilistic  pairing 
algorithm  generates  2  messages  sets,  each  of  which  can  be  delivered  in 
0(/?(G))  delivery  cycles.  □ 
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10.4  Communication  Tree  Broadcasting 

e 

Definition  6  The  projection  of  an  edge  u — v  in  a  communication  tree 
CT,  where  u  is  the  father  of  v  is  the  path  u  —»  v  in  the  directed  cycle  C\ 
when  v  is  a  right  child,  and  the  path  v  —»  u  when  v  is  a  left  child. 

Lemma  10  The  communication  tree  broadcasting  algorithm  generates  0(log  IVJ) 
messages  sets,  each  of  which  can  be  delivered  in  /3(G)  delivery  cycles. 

Proof:  Consider  the  set  of  father  to  right  child  edges  at  one  level  of  a 
communication  tree  CT,.  The  projections  of  these  edges  are  all  paths  from 
the  fathers  to  their  right  children.  In  an  inordcr  traversal  of  CT,  we  visit 
the  endpoints  of  these  edges  from  left  to  right,  always  visiting  a  father  and 
his  right  child  consecutively.  As  these  endpoints  appear  in  C\  in  the  same 
order  that  they  appear  in  the  traversal,  the  projections  of  these  edges  are 
disjoint.  As  in  lemma  3,  we  can  shortcut  these  disjoint  projections  in  all 
communication  trees  without  increasing  the  load  factor  of  Me,’.  A  similar 
argument  holds  for  father  to  left  child  edges.  Thus  we  can  send  the  left  to 
right  message  set  and  the  right  to  left  message  set  at  each  level  in  0((3(G)) 
delivery  cycles. 

CT,  has  height  0(logjV|)  so  that  the  communication  broadcasting  al¬ 
gorithm  generates  0(log  |V|)  messages  sets,  each  of  which  can  be  delivered 
in  0(/?(G))  delivery  cycles.  □ 

10.5  Minimum  Spanning  Forest 

Lemma  11  Using  the  deterministic  pairing  algorithm,  the  minimum  span¬ 
ning  forest  algorithm  generates  0(log3  |Vj)  message  sets,  each  of  which  can 
be  delivered  in  0(/3(G))  delivery  cycles. 

Proof:  Sollin’s  algorithm  performs  0(IogjVr|)  iterations  [GHj.  Below  is  an 
analysis  of  the  steps  that  occur  in  each  iteration  of  the  algorithm. 

1.  By  definition,  the  message  set  generated  in  step  1  can  be  delivered  in 
0(/3{C))  delivery  cycles. 

2.  No  messages  arc  generated  in  step  2. 


3.  By  lemma  10,  step  3  generates  0(log  |V  j)  message  sets,  each  of  which 
can  be  delivered  in  /3(G)  delivery  cycles. 

4.  Sec  step  3. 

5.  See  step  1. 

6.  By  lemma  8,  step  G  generates  0(log2  |V|)  message  sets,  each  of  which 
can  be  delivered  in  0((3(G ))  delivery  cycles. 

7.  See  step  3. 

□ 

Lemma  12  Using  the  probabilistic  pairing  algorithm,  the  minimum  span¬ 
ning  forest  algorithm  generates  0(log2  (Kj)  message  sets,  each  of  which  can 
be  delivered  in  O(0(G))  delivery  cycles. 

Proof:  The  proof  is  analogous  to  that  of  lemma  11  in  which  by  lemma  9 
step  G  generates  0(log2  j V |)  message  sets,  each  of  which  can  be  routed  in 
0((3(G))  delivery  cycles.  □ 

11  Synchronizing  Message  Sets 

We  have  previously  assumed  a  hardware  synchronization  mechanism  that 
lets  each  processor  know  when  the  routing  of  each  message  set  is  to  start 
and  end,  and  when  each  delivery  cycle  of  that  message  set  is  to  start  and 
end.  With  fixed  length  messages,  all  delivery  cycles  require  the  same  fixed 
number  of  clock  cycles.  Thus  if  a  processor  knows  when  a  message  set  is  to 
start,  he  can  keep  synchronized  with  each  delivery  cycle  by  keeping  a  local 
counter. 

We  would  like  to  remove  the  need  for  a  hardware  message  set  synchro¬ 
nization  mechanism.  The  difficulty  is  that  the  exact  number  of  delivery 
cycles  needed  to  route  a  message  set  is  not  known  until  all  of  the  mes¬ 
sages  in  that  set  have  reached  their  destinations.  Furthermore,  different 
processors  will  finish  sending  their  messages  in  different  delivery  cycles. 
However,  by  computing  off-line  a  communication  tree  that  contains  all  of 


the  processors  in  the  fat-trec,  we  can  provide  a  message  set  synchroniza¬ 
tion  mechanism  with  no  dedicated  hardware  other  than  an  increase  in  each 
channel  capacity  of  1.  . 

In  Figure  17  we  show  the  cycle  from  which  we  will  build  the  synchro¬ 
nization  communication  tree.  This  cycle  contains  the  processors  in  FT  in 
the  order  that  they  arc  visited  in  an  inorder  travcrs.il  of  FT.  Figure  18 
shows  that  the  the  load  of  each  channel  due  to  this  cycle  is  at  most  1.  Fig¬ 
ure  19  shows  the  process  of  building  the  synchronization  communication 
tree.  This  communication  tree  is  computed  once  ofT-line  and  never  changes. 

Using  the  communication  tree  illustrated  in  Figure  19,  the  synchroniza¬ 
tion  algorithm  is  as  follows.  To  start  a  message  set,  processor  p3  broadcasts 
a  start  signal  down  the  communication  tree.  In  this  case  each  of  the  2log2\P\ 
messages  sets  generated  by  the  communication  tree  broadcasting  algorithm 
will  require  exacty  one  delivery  cycle.  When  a  processor  receives  a  start  sig¬ 
nal,  he  begins  sending  his  messages  in  M  in  the  next  delivery  cycle.  When 
a  processor  has  finished  sending  all  of  his  messages,  he  waits  for  each  of 
his  children  in  the  synchronization  communication  tree  to  send  him  a  mes¬ 
sage  confirming  that  they  have  finished  sending  their  messages,  and  then 
forwards  the  message  to  his  father  in  the  synchronization  communication 
tree. 

The  synchronization  messages  will  not  reduce  the  probability  of  any 
message  in  M  being  successfully  transmitted,  for  although  at  each  concen¬ 
trator  switch  the  number  of  messages  arriving  may  have  increased  by  1,  the 
capacity  also  increased  by  1. 

For  load[c)  >  cap(c)  wc  have  the  following  relation: 

.  toad(c) 
cap(c) 

<  *(c) 

Thus  we  have  actually  decreased  the  load  factor  of  each  channel. 

These  synchronization  messages  may  add  an  additional  log  |P|+Ioglog  \P\ 
delivery  cycles  to  the  number  of  cycles  needed  to  route  A/,  for  example  in 
the  case  when  all  processors  finish  sending  their  messages  in  Af  in  the  same 
delivery  cycle. 
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Figure  19:  Synchronization  Communication  Tree 

12  Comments 

The  algorithms  described  in  the  previous  sections  arc  all  SIMD  (Single 
Instruction,  Multiple  Data)  in  nature.  In  each  instruction  cycle,  every  pro¬ 
cessor  executes  the  same  instruction.  Processors  behave  differently  when 
they  operate  on  different  data.  Wc  chose  to  design  our  algorithm  using  the 
SIMD  paradigm  only  because  it  is  conceptually  simpler  than  the  MIMD 
(Multiple  Instruction,  Multiple  Data)  paradigm.  Wc  do  not  mean  to  im¬ 
ply  that  parallel  algorithms  should,  in  general,  use  the  SIMD  paradigm. 
Similarly,  wc  chose  a  very  simple  message  passing  protocol.  The  only  in¬ 
teraction  between  a  sending  processor  and  a  receiving  processor  is  a  final 
acknowledgement.  More  complicated  mechanisms  can  be  realized  with  es¬ 
sentially  the  same  hardware.  For  example,  instead  of  passing  a  message  to 
the  receiving  processor,  the  sending  processor  might  send  a  request  to  read 
some  portion  of  the  receiving  processor’s  memory.  The  receiving  processor 
would  then  reply  with  that  data  instead  of  sending  a  simple  acknowledge¬ 
ment.  There  may  be  profound  reasons  for  choosing  tire  SIMD  or  the  MIMD 
paradigm,  or  for  using  some  particular  message  sending  protocol,  but  we 
have  not  dealt  with  these  issues  in  this  paper. 

In  tins  paper  wc  have  examined  a  technique  for  keeping  communication 
costs  down  throughout  a  parallel  algorithm.  Our  technique  is  to  construct 
“communication  trees"  from  cycles  of  processors.  If  wc  think  of  each  cycle  of 
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processors  as  a  set  of  processors,  then  we  can  imagine  using  communication 
trees  to  implement  a  variety  of  basic  set  operations.  Our  current  algorithms 
for  even  such  simple  operations  as  computing  the  union  oi  two  sets  are  very 
expensive.  We  must  discard  the  communication  tree  of  each  set  and  build 
a  completely  new  communication  tree.  We  expect  that  future  research  will 
explore  such  problems  as  merging  two  communication  trees  directly,  and 
computing  the  most  efficient  communication  tree  for  a  set  of  processors. 

Finally,  the  message  set  routing  results  of  Leiserson  and  Greenberg  [L, 
LG]  show  that  no  matter  how  large  the  load  factor  of  a  message  set,  we  can, 
for  a  given  amount  of  hardware,  deliver  it  in  almost  optimal  time.  Thus  if 
a  problem  takes  a  long  time  to  run  on  a  fat-tree,  then  it  will  take  a  long 
time  to  run  on  any  architecture.  These  observations  lead  to  the  somewhat 
obvious  conclusion  that  we  should  examine  those  problems  for  which  we 
only  need  to  generate  message  sets  with  small  load  factors. 
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